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Q . We study a class of conditional independence models for discrete data with the property 

Q i that one or more log-linear interactions are defined within two different marginal distri- 

ct I butions and then constrained to 0; all the conditional independence models which are 

known to be non smooth belong to this class. We introduce a new marginal log-linear 
■ parameterization and show that smoothness may be restored by restricting one or more 

independence statements to hold conditionally to a restricted subset of the configurations 
of the conditioning variables. Our results are based on a specific reconstruction algorithm 
from log-linear parameters to probabilities and fixed point theory. Several examples are 
examined and a general rule for determining the implied conditional independence restric- 
tions is outlined. 
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parameterizations. 



1. Introduction 



ps| ! Conditional independence models for discrete data are determined by a set of con- 

straints on log-linear interactions defined within different marginal distributions of a 
contingency table. The family of hierarchical and complete marginal log-linear parame- 
^ ■ terizations (HCMP for short) introduced by Bergsma and Rudas [4] provides a general 

framework for combining log-linear constraints defined on a collection of marginal distri- 
butions into an overall joint distribution. Methods for determining whether and how a 
conditional independence model may be translated into a HCMP have been studied by 
Rudas et al. [13] and Forcina et al. [9] among others; the fact that a HCMP exists, is a 
sufficient condition for the model to be smooth. 

On the other hand, it is known that no HCMP exists when a model imposes constraints 
on the same log-linear interaction defined in two different marginals. It has been shown 
[4, Theorem 3] that, when the same interaction is defined in two different marginals, the 
jacobian of the mapping from log-linear parameters to probabilities is singular for the 
uniform distribution. Though, formally, this does not imply that the model itself has 
singularities, all known models with singularities correspond to cases where no HCMP 
exists because one or more interactions are constrained more than once. In this paper 
we study the class of conditional independence models where the same interaction is 



Preprint submitted to Elsevier 



October 31, 2012 



constrained in two or more marginal distributions and we show, essentially, that any such 
model is non smooth but can be turned into a smooth model by restricting it to a suitable 
context specific conditional independence model. 

Following Bergsma and Rudas [4], we may assume, without loss of generality, that 
the marginal distributions of interest have been arranged in a non decreasing order and 
that they will be reconstructed one at a time starting from the smallest. Because the full 
joint distribution is simply the last marginal in this list, we need only to consider how to 
determine a given marginal distribution when one or more log-linear interactions to be 
constrained have already been defined and/or constrained in a previous marginal. A useful 
tool for reconstructing marginal distributions in a sequence is the mixed parameterization 
[e.g., 2] by which we may combine the marginal probabilities from previous marginals 
with the log-linear interactions defined in the marginal distribution under consideration. 
Because the mapping produced by the mixed parameterization is one to one and smooth, 
the question of whether a model is smooth up to a given marginal, is equivalent to the 
question whether an algorithm based on the mixed parameterization exists and converges. 
By using results from the theory of fixed point algorithms, we study the jacobian of a 
new reconstruction algorithm that allows certain log-linear interactions to be redefined 
and show that this may either converge, and thus the model is smooth, remain at the 
starting point irrespective of the starting value, implying that the resulting distribution 
is not uniquely determined by the log-linear parameters or, simply not converge. A 
formal proof of these properties is derived under complete independence and we provide 
substantial evidence to support the conjecture that our results hold in the general case. 

The results derived in this paper help clarifying which interaction parameters may be 
redefined and which other interactions should be omitted as a replacement. In particular 
we show that smoothness is restored only when a specific subset of other interactions 
is omitted; these interactions have the property that, when they are missing, and thus 
unconstrained, the conditional independence of interest holds only on a subset of the 
configuration of the conditioning variables. Log-linear models which allow context spe- 
cific conditional independences have been studied in detail by Hojsgaard [10] who also 
derives a markov property for undirected graphs involving context specific conditional 
independencies. A special case of the results derived here was considered by Roverato 
et al. [12]. 

In section 2, we introduce the basic notations, define marginal log-linear interactions 
and review the properties of the mixed parameterization. In section 3, after presenting a 
set of motivating examples, we introduce a new algorithm for reconstructing a marginal 
distribution when interactions defined in previous marginals have to be constrained again 
and we analyze its convergence properties. In section 4 we study the consequences on the 
original conditional independence statements of omitting constraints on a specific subset 
of higher order interactions and show that this results in context specific restrictions. 

2. Notations and preliminary results 

We study the joint distribution of d discrete random variables where Xj, j = 1, . . . ,d, 
takes values in (0,...,rj). For conciseness, we denote variables by their indices and 
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use capitals to denote non-empty subsets of V = {1, . . . , d}; such subsets will determine 
the variables involved either in a marginal distribution or in an interaction term. The 
collection of all non-empty subsets of a set M C \/ will be denoted by V{M). In the 
following we write ziZ2 ■ ■ - ik as a shorthand notation for {ii,i2 . . . , ik}- For a given M C 
V, the marginal distribution in M is determined by the cell probabilities Pm{^m) = 
P{Xj = Xj,Wj G M). We introduce a shorthand notation that allows to specify the 
values of selected subsets of the arguments in a marginal probability and on the log-linear 
interactions to be defined below. Let J C / C M, then pM{xj,xi\^j,xyi\j) denotes the 
marginal probability where xj is the value of Xh, h & J, ajjy the value of Xh, h G I\J and 
x-^\j the values of Xh, h G M\I. We will also write Ojy to state that Xh = 0, V/i G I\J. 

2.1. Marginal and conditional log-linear interactions 

Though there are many different ways of coding marginal log-linear parameters, pa- 
rameters defined by different codings are linearly related; thus there is no loss of generality 
in using the reference category coding, where comparisons are with respect to the category 
taken as reference, usually the first. 

Definition 1. A reference category log-linear interaction I within M is defined by the 
following expression 

mM^I I XM\l) = $^(-1)'^'^' \0gpM{Xj, Oy\j, aj^j), (1) 
JC/ 

where, ^i E I,Xi > 0. 

Example 1. The logit of Xi at Xi computed within M is 

Vi;Mixi I a;^j) = hgpMixi, x^^) - logpM(Oi, x^^), Xi > 
and the log-odds ratio for Xi = Xi, Xj = xj is 

VH-Mi^h I Xj^jj) = \ogpM{Xi, Xj, Xj^jj) - logpjv/(Oi, Xj, Xj^jj) 

- logpM{Xi, Oj, X]^h) + logpM(Oi, Oj, Xj^jj) 

where H = iU j . 

It may be easily verified that, given h G M\I and H = lU h, (1) implies the following 
recursive relation 

VH;Mi^H I a^jyiyn) = r]l-M{Xi I Xh,Xy^\ji) - rji-Mixj I Oh,Xyp^^), (2) 

this indicates that interactions of higher order may be constructed by a sequence of first 
order differences starting from logits. 

Whenever M\I is not empty, marginal log-linear interactions depend on the value of 
the remaining variables. Because (1) is a contrast of logarithms of marginal probabilities, 
it can be easily verified that rjj.Mixi \ x-^\i) is the log-linear interaction I in the marginal 
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distribution M conditionally on Xh = Xh^h E M\I. Clearly, within the full collection 
of marginal log-linear interaction parameters conditional on the configurations of the 
remaining variables, there is a substantial amount of redundancy. Below we show that 
these parameters are linearly related and that they can all be written in terms of the 
subset where the conditioning variables are all fixed at their reference category; this 
subset contains non redundant elements. 

For a given rji-M^Xi \ a^Myj) let h e M\I, then (2) may be used to obtain 

rji-M^Xi I a3M\l) = Vi-Mi^i I O^^'^Cm^h) + VH;hA^i^^h I a:;M\H)- 
Repeated use of the relation above leads to the following expansion 

m;M{Xi I iCM\l) = 5Z ^li-Mi^H I Om\h)- (3) 
ICHCM 

The above equation shows that any marginal log-linear interaction may be written as a 
linear function of all possible higher order interactions conditional to the initial category 
of the remaining variables within the given marginal. For simplicity, in the following, 
we write rji-Mixj) as a shorthand for rji-Mixj \ 0-^\i). An alternative way of removing 
conditioning variables, which has been applied to interactions defined as contrasts of 
averages of logarithms of probabilities, but could be applied to any type of interactions, is 
to average across the set of all possible configurations of the conditioning variables iCM\r 
The log linear interactions used by Bergsma and Rudas [4], among others, are defined in 
this way; Lemma 8 in the Appendix shows that these interactions are linear functions of 
all the interactions r]H-M{xj) for H ^ I. 

Example 2. Suppose that M = I U hU k, then 

r]i.M{Xl I Xh.Xk) = r]l;M{Xl) +rjiyjh.M{Xi,Xh) + VlUk-^i^ii ^k) + ?7/UhUfc;A/(a37, x^, x^). 

For any / G V{M), it is convenient to arrange the log-linear interactions rjj-Mixi) into 
the vector r]{I, M) with elements in lexicographic order of Xj; this vector may be written 
as 

r7(/,M) = C(/,M)logp(M), (4) 

where C{I,M) = (8)^=i Cj and Cj = (-1^^ J^J if j e / and Cj = (l,0;^) otherwise. 

Let also ti{M) = C{M)logp{M) denote the vector obtained by stacking the r]{I,M) 
components one below the other in lexicographic order relative to / G V{M). It is 
well known that under multinomial sampling, rj{M) constitutes a vector of variation 
independent canonical parameters for p{M). Let G{I,M) = <S>'j=iGjy where Gj is an 
identity matrix of order rj + 1 without the first columns if j G / and Ir^+i otherwise. 
Let G{M) be the matrix whose columns are given by the G{I, M) matrices arranged one 
aside the other in lexicographic order. It is easily verified that G{M) is the right inverse 
of C{M)] this implies the reconstruction formula 

logp(M) = G{M)r]{M) - 1 log{l' exp[G(M)r7(M)]}. (5) 
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2.2. The mixed 'parameterization 

Within the distribution in M, the vector of mean parameters ii'p(M) X^i P- ^21] is the 
expected value of the sufficient statistics for r]{M) in a sample of size 1 and equals 

ix^^u^ = G'{M)p{M); 

there is a diffeomorphism between fi-p(^M) r){M) [2, p. 121]. Because each block 
of rows C{I,M) in C{M) corresponds to a block of columns G{I,M) in G{M), we 
may define = G{I, M)'p{M) to be the collection of mean parameters for a given 
interaction. It is worth noting that, though mean parameters, like canonical parameters, 
are associated to interactions / G V{M), may be defined in any marginal such that 
/ C M . Having coded the canonical parameters as contrasts with respect to the initial 
category, the corresponding mean parameters are simply marginal probabilities. 
We recall a definition and a few results which are relevant in the following. 

Definition 2. For an arbitrary margin M , let (U, V) be a partition ofV{M); the pair of 
vectors [Hu m y l-''v] > where 77^ = {r}{I,M),I G U) is composed of canonical parameters, 
and /Xy = e V) is composed of mean parameters, constitute a mixed parameteri- 

zation of the marginal distribution p{M). 

In the following, to be short, we will often refer to the log-linear parameters fJu m — 
(77(7, M), J G W) as log-linear interactions in U or collection U of log-linear parameters. 

Lemma 1. For any mixed parameterization, there is a diffeomorphism between the vec- 
tor of mean parameters ^•p(^) and the pair of vectors [Hu mi t^v\' addition, the two 
components are variation independent. 

Proof. See [2, p. 121-122] □ 

The numerical algorithm for reconstructing p{M) from [riuMit^v] given by Forcina 
[8] is a faster alternative to the usual IPF algorithm. 

The mixed parameterization is a powerful tool for reconstructing a joint distribution 
from marginal log-linear parameters because one can process one marginal distribution 
at a time by combining the log-linear parameters defined within that distributions with 
the mean parameters, or, equivalently, marginal probabilities, available from marginal 
distributions reconstructed in previous steps. As long as these two sets of interactions are 
a partition of V{M), the basic argument used by Bartolucci et al. [3] implies that any 
model defined by linear constraints on the marginal log-linear parameters constitutes a 
curved exponential family and thus is smooth. 

3. The LM reconstruction algorithm 

In this section we investigate the properties of conditional independence models which 
require to impose non trivial constraints on the same log-linear interactions defined in 
two or more marginal distributions. We may suppose, without loss of generality, that the 
marginals of interest are arranged in non decreasing order and that they will be processed 
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one at a time, starting from the first one. In this way, at each step in the reconstruction of 
the joint distribution from its marginal log-hnear parameters, we need only be concerned 
with the marginal at hand and examine whether, by use of the mixed parameterization, we 
may combine the mean parameters from previous marginals with the log-linear parameters 
which are either available or need to be constrained in the marginal under consideration. 
An algorithm for doing this is presented and its convergence properties investigated. 

3.1. Motivating examples 

We now present a set of examples which will highlight different features of the kind of 
models we are going to consider. Each model is made of two parts: (i) a list of conditional 
independencies which have been accommodated, somehow, in previous marginals (ii) an 
additional conditional independence to be imposed in the current marginal M. We start 
with a couple of elementary models: 

Example 3. Suppose that, having assumed that 1_LL2 | 3 in the marginal 123, in the 
marginal M = 1234 we want also 1_IL2 | (3,4). Here we need to constrain again the 
{12, 123} interactions; in the binary case, Evans [7] has shown that the model has singu- 
larities. 

Example 4. Suppose that, having assumed that 1_LL (2, 4) in the marginal 124, '^^ want 
also 2_LL4 | (1,3). Here, in addition to the 124 interaction which has already been con- 
strained in 124, we need to constrain 24 which was defined in the previous marginal; in 
the binary case, Drton [5] has shown that the model has singularities. 

The nest example is a little more complex: 

Example 5. Having assumed that 1_LL2 | 3 and 1_LL3 | 4 we also want 1_LL(2,3) | (4,5); 
here the list of interactions to be constrained again is given by {12, 123, 13, 134}. 

The following examples are different because the collection of interactions that have 
already been defined in previous marginal is too large to be redefined again in M: 

Example 6. Suppose that, having set liL2 | (3,4) and liL2 | (3,5) we also want 1_LL2 | 
(3,4,5); here the collection of interactions that have already been defined and that have to 
be constrained again is {12, 123, 124, 125, 1234, 1235}. 

Example 7. Suppose that, having set liL2 | (3,4), 1_LL2 | (3,5) and 1_LL2 | (4,5) we 
also want 1_LL2 | (3,4,5), here all the interactions in the ascending class from 12 to 
M = 12345, except M itself, have to be constrained again. 

3.2. Setting up the framework 

Let M denote the current marginal, V the collection of interactions defined in previ- 
ous marginals which belong to V{M) and C = V{M)\V. Let also A be the collection of 
interactions to be constrained in M according to the last conditional independence state- 
ment; whenever V fl ^ 7^ 0, we are trying to constrain again the corresponding log-linear 
interaction. Though we would like to redefine and constrain in M all the interactions in 
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V n A, we shall see that this is not always possible; denote by X C (V fl ^) the actual 
collection which we redefine in M and TZ = V\X the remaining interactions. 

Because the mean parameters in X U 7^ together with the log-linear parameters in C 
constitute a mixed parameterization of p{M), these parameters determine uniquely the 
value of the log-linear parameters in X to be redefined within M; thus they cannot be 
constrained again, unless we remove from C a collection, say "H, of log-linear interactions 
with exactly the same number of parameters as the collection X; below we investigate 
whether such an atypical parameterization may provide a smooth mapping. We shall 
see that the two sets X, must be chosen carefully and satisfy a set of conditions which 
establish a close relation between them. 

Example 8. Consider again example 6, here V = P(1234) UP(1235), A = {12, 123, 124, 
125, 1234, 1235, 1245, 12345} and A^V = {12, 123, 124, 125, 1234, 1235}; as we shall see, 
not all the elements of this collection can be redefined in M , the most we can achieve is to 
set X = {12, 123} and T-i = {1245, 12345} where X4 and are fixed to a given category. 

Example 9. In example 4, V = V{12A), C = P(1234)\P(124); suppose we set I = 
{24, 124} and % = {234, 1234} where X3 is fixed to a given category; it can be easily 
checked that "H indexes the same number of parameters as X. 

3. 3. Description of the algorithm 

The problem, when reconstructing the distribution in M, is how to combine the mean 
parameter /Xj, available from previous marginals with the log-linear parameters tJx-m 
defined again in the present marginal. Recall that the mixed parameterization require 
that mean parameters and log-linear interactions must refer to two complementary sets 
whose union is V{M). The idea is to remove from the log-linear parameters C, to be 
defined in M, the subset H with the same number of parameters as the elements of X. 
The algorithm that we describe below can handle such a context and the issue will be 
to determine under which conditions such an algorithm may converge; if it does, then it 
can be shown that the model is smooth. The algorithm for reconstructing the marginal 
distribution in M is made of two steps and require starting values for tj-^-m'- 

M-step given the latest guess for the log-linear parameters rjy^.j^i, an updated estimate for 
the vector of mean parameters /x^ may be computed by a mixed parameterization 
with mean parameters indexed by the collection of interactions TZUl and log-linear 
parameters indexed by C; 

L-step given the latest guess for the vector of mean parameters /x^, an updated es- 
timate for the vector of log-linear parameters rj-^.j^j may be computed by a mixed 
parameterization with mean parameters indexed by TZWH and log-linear parameters 
indexed hy XU{C\H). 

In order to examine the properties of the LM algorithm, we need to determine how 
changes in the input value of Ty-^.jv/ in the M step affects the output value produced in the L 
step. For this purpose, we recall results concerning the derivatives of certain components 
of the mixed parameterization relative to others which are relevant here. In the following 
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write TT as a shorthand for p(M), let Dt^ = diag(7r) and let ft = Dt^ — tttt' denote the 
derivative of tt with respect to r]{M)'. 



Lemma 2. 



^^^^ = = G{Myn{M)G{M), 



is the covariance matrix of a collection of distinct binary variables determined by the 
columns of G{M) and thus is positive definite. 

Proof. See Forcina [8]. □ 

Any two subsets of interactions "H, /C C V{M) determine two sub-collections of binary 
random variables and a block in the covariance matrix F{M). In the following we omit 
reference to the marginal M when it is obvious from the context and write 

F-hk: = G'-^VIGk.- 

Lemma 3. In the M-step, where T-i is part of the log-linear parameter 

^^^'H _ r, _ TP TP TP-I 



I > JTI T71 TP 



in the L-step, where "H is part of the mean parameter 



where we have used the formula for the inverse of a partitioned matrix. 

Proof, the result follows from Lemma 4 in Forcina [8]. □ 

A full step of the LM algorithm may be seen as a fixed point function which, given 
a guess value of rjy^.M: produces an updated estimate of the same vector. A sufficient 
condition for an algorithm to be a contraction [see for example 1], a property which 
implies that it converges to a unique solution, is that the jacobian of a full LM step has 
spectral radius (maximum absolute eigenvalue) strictly smaller than 1. Let J = AT B 
be the jacobian of this mapping; let also QxH\n = Fxn — FxnF^j^F-ji-H. An upper bound 
for the spectral radius of J is determined in the following lemma. 

Lemma 4. The spectral radius of J is always less than 1 except when Qx-H\n ^■^ '^^^ ^/ 
full rank. 

Proof. See the Appendix. □ 

The main result of this section is contained in the following Theorem and concerns the 
properties of the mapping from ^ = ij1^yjx\'H m' ^v) ''^^ under the assumption that the 
elements of ^ are compatible, that is there is at least a tt with the parameters specified 
by The result depends on the spectral radius of the jacobian matrix J defined above. 
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Theorem 1. Under the assumption that the elements of ^ are compatible, when Qxnin 
is of full rank, the mapping from (^£uj\^ A*v) ^■^ ^'^^ ^'^^ ^'^^ smooth. In the 
special case when Qxnin ~ ■^^ ^^^^ ^■^ identity matrix, the mapping is not one 
to one. When Qxnin ^■^ singular but different from a null matrix, the algorithm does not 
converge and nothing can be said about the smoothness of the mapping. 

Proof. Consider the sequence of vectors produced by the LM algorithm: tj^m^ ''1h;Mi ■ ■ ■ i 
where r/^^^ is the starting value and ?7^^^/ is the output of one step of the LM algorithm 

when we use rp^.M as input; because we have assumed that there is at least a compatible 
solution inside the parameter space, [1, Theorem 1.1] implies that, if the spectral radius of 
J is strictly less than one, the sequence converges to a unique solution. At convergence the 
argument in [3, Theorem 1] can be applied to show that the mapping is a diffeomorphism. 
In the special case when the jacobian matrix J is an identity matrix, rf^.j^ = r]^.^,j, so 
the algorithm converges in one step, irrespective of the starting value. This implies that, 
if 7r(o) is the probability vector corresponding to ^7^^*^^/ there is a whole neighbourhood of 
7r(°) whose points share exactly the same vector ^ of mean and log-linear parameters. □ 

Remark 1. According to Theorem 1, a model may be smooth even if the log-linear inter- 
actions in X are defined and constrained in two different marginals. This is apparently in 
conflict with the result of [4, Theorem 3] which says that the jacobian obtained by differ- 
entiating the same log-linear interaction I defined in two different marginals, say Mi, M2, 
with respect to p, is singular for the uniform distribution, a condition which is necessary 
(but not sufficient) for a model to have singularities. However, when the set M\I is 
not empty, the log-linear interactions defined by Bergsma and Rudas [4] are constructed 
by averaging conditional interactions across all possible configurations of the conditioning 
variables. As mentioned in section 2.1, the results of Lemma 8 in the Appendix imply that 
any constraint on one of their log-linear interactions is equivalent to a linear constraint 
on the whole ascending class of our interactions with minimal element I and maximal 
element M . Hence the LM algorithm is not directly applicable to interactions defined in 
that way. 

3.4. Convergence of the algorithm 

Below we derive a more convenient expression for QxH\n show that the matrix is 
non singular under complete independence, if the set l-L satisfies certain conditions. We 
also determine conditions under which Qxn\Ti singular or null. Finally, we discuss the 
singularity of the same matrix when p{M) is unrestricted. 

Let P0 = Itt' be the projector, according to the metric defined by the matrix -Dtt, on 
the space spanned by the vector 1. By simple algebra, it can be shown that: 

Fxn = G'xnGH = G'x{I-P^)'D^{I-P^)GH. 

From the previous result, it follows that: 

Qxnn = G'xD^il - P^)(/ - P0)G^, 
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where 

Pn = il- P,)GnF^^G'T^{I - P,)'D^ 

is the projector, according to the metric defined by the matrix -Dtt, on the space spanned 
by the columns of (/ — P%)Gti. 

Let S{X) denote the space spanned by the columns of X, for every a C M let 
also Xa = l^j^jyjXj, where Xj = Ij if j e a and Xj = Ij otherwise, and let Pa = 
X a{X'^DT^X a)~^X'^DTj- be the projection matrix onto S{Xa)- Let Xxun be the matrix 
made by the columns of Xa^a eTUTZ and Pxvjn the projection onto S{Xxyjn)- Because 
S{Gti) and 5(1) belong to S{Xxyjn) the projection matrix Pxun commutes with both 
P-j^ and Pg; in addition, by using the identity PxutzGx = Gx it follows that 

Qxnin = G'xD^il - P^)(/ - P^)PxunGn. (6) 
3. 4-1- The case of complete independence 

Lemma 5. Under complete independence of the variables in M, (i) if H contains an 
interaction v ^ A, the corresponding columns in QxH\n ^'^^ null, (ii) if H contains an 
interaction v where at least one of the variables in v is not binary and not contained in 
any element ofV, Qxn\n ^ block of columns which is not of full rank. 

Proof. See the Appendix □ 

Lemma 5 suggests two necessary conditions for QxH\n to be non singular: 1-i cannot 
contain interactions not in A and all the non binary variables involved in the class of 
interactions % and not present in the class X U 7^, must be fixed to a single category 
different from the reference category. This implies that only a limited number of higher 
order interactions in M can be used as a replacement for those in X. The definition 
below provides a set of conditions for % which will be shown to be sufficient. Let /C = 
{mi, . . . , rrtr} be the family of the maximal sets of X U 7^; for t G X, let K{t) = {m : m G 
/C, t C m} be the family of sets m E K, that contain t, /C(t, h) be the family of the sets Q, 
g e V[IC{t)], such that h D f]^^^g mj = and h) = V{lC)\lC{t, h). 

Definition 3. A set is a valid replacement for a given X if it satisfies the following 
conditions: 

(i) there is a one to one correspondence between the elements of I andl-i such that, for 
each t G X, there isav = tUhE'H,tnh = (!), where the variables in h are fixed to 
a given category different from the reference category; 

(Hi) there exists a complete ordering "-<" in I, coherent with the partial ordering of set 
inclusion, such that, for every tUh eT-L, Q E lC{t,h) and s = ^flmjeg "^i) ^(t^ h) 
either s eTZ, or s E Z and s t. 

To clarify these notions, we discuss a few examples where we write {t, h) as a shorthand 
for t U /i if t U /i G "H. 
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Example 10. In example 3 with X = {12,123} and H = {(12, 4), (123, 4)}, all condi- 
tions are trivially satisfied with K, = {123}; this is also the only element of )C{t, h), Vt, h, 
the same happens in example 4- In example 5 with the ordered set X = {12, 13, 123, 134} 
and = {(12, 5), (13, 5), (123, 5), (134, 5)} condition (i) is clearly satisfied. In this case we 
have: /C = {123, 134}, /C(134, 5) = {134}, /C(123, 5) = {123}, /C(13, 5) = {123^134, {123, 134}} 
and /C(12,5) = {123} and condition (ii) is satisfied by these sets. For ^(134,5) = 
{123, {123, 134}} condition (Hi) is satisfied with s = 13. The sei£(123,5) = {134, {123, 134}} 
satisfies (Hi) because s=13. In the case of )C{12, 5) = {134, {123, 134}} (Hi) holds because 
s = 1. The family ^(13,5) is empty and so in this case condition (Hi) is void. 

Example 11. Having assumed 1_LL2 | (3,4), 1_LL2 | (3,5), 1_LL2 | (3,6), we also want 
1_LL2 I (4,5,6). It can be verified that H = {(12, 56), (124, 56)} is a valid replacement 
for X = {12,124},- here fC = {124,125,126} and /C(124,56) = {124}. It is easy to 
see that (i) and (ii) are satisfied. Condition (Hi) holds in the case o/ ^(124,56) = 
{125, 126, {124, 125}, {124, 126}, {125, 126}, {124, 125, 126}}, because apart from the first 
two elements which produce sets s that belong to TZ, all the others produce s = 12. A sim- 
ilar remark holds in that case of /C(124, 56). However, if we set X = {12, 124, 125, 126}, 
though "H = {(12, 56), (124, 56), (125, 4), (126, 4)} satisfies conditions (i) and (ii), (Hi) 
does not hold. 

We now give an instance where condition (ii) is not satisfied. 

Example 12. Suppose that 1M2 \ (3,4), 1M2 \ (3,5) and finally 1M2 \ (3,4,5,6). 
Though the best choice would be to set X = {12, 123, 124, 125, 1234, 1235}, if we set X = 
{12,123} and H = {(12, 46), (123, 46)}, condition (ii) is not satisfied. 

Lemma 6. A pairX, "H, where is a valid replacement, always exists; for instance, take 
X = {t}, where t is one of the minimal elements of A, andH = {tUh}, where h contains 
all the variables that belong to at most one element of lC{t) when /C(t) is not a singleton 
and by the variables that do not belong to the unique element of lC{t) otherwise. 

Proof. See the Appendix □ 

Example 13. In example 5, the minimal element t of A can be 12 or 13. If t = 12 
then IC{t) = {123} is a singleton. In this case h=45 and "H contains only (12,45), the 
family ]C{t,h) contains only {123}. Instead, if we set t = 13, /C(t) = {123,134} and 
we must set h = 45, thus )C{t,h) contains only the set Q = {123,134}. In example 12, 
K = {1234, 1235}, with t = 12 we must set h = 456 and {1234, 1235} is the only set in 
K{t, h). In example 11 with t = 12, /C(t) = {1234, 1235, 1236}, thus we must set h = 456; 
here ]C{t, h) has 3 elements of size 2 and 1 element of size 3 and the sum in condition (ii) 
of Definition 1 is -2. 

Let Gt^hiJh) be the sub-matrix of Gtuh where variables in h are fixed to jf^. 

Lemma 7. Under complete independence: 

a) S{PaGt,hijh)) C S{Gr), where r = a n (t U h), 

b) iftCaandanh = 0, PaGt,h{,3h) = GtP{xh = Jh)- 
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Proof. See the Appendix. 



□ 



Theorem 2. Ifn is a, valid replaceuient for X, under complete independence, QxhItz 
non singular. 

Proof. Because under complete independence the projectors Pa-, a (1 M commute, we 
can write Pxun = Ege-p{/c)(-l)^"*''^' Hmeg ^rn, it follows that: 

g&K{t,h) meg g£K.{t,h) meg 

Condition (ii) of definition 3 and b) in lemma 7 imply that the first sum is kGt, with k ^ 0. 
Condition (iii) of definition 3 and a) in lemma 7 imply that, when an element in the second 
sum, say U, is such that S{U) C Si^G-ji), when we left multiply by {I — P'j{){I — P(i)), we 
get a null matrix because (/ — P-ji) projects onto the space orthogonal to (/ — Pq)Gtz. 
In all other cases there exists a non null matrix Ag^t, s ~< t, such that: 

{I - P7^)(-l)'+l^l n P-^Gt,h{3H) = {I- P'R.)G,A,,t. 

m£g 

The matrix G-^ is made of blocks of columns of the form Gt^hUh) ^^'^ may assume, 
without loss of generality, that these blocks are in the same order as the elements of I 
specified in condition (iii), then it follows that 

Qxn\n ~ Qxiin^^ 

where the matrix A has blocks Ag^t, s,t E T such that As^t = if t -< s and, because 
of condition (ii) of definition 3, the diagonal blocks At^t, are proportional to an identity 
matrix, thus A is lower triangular and non singular. The result follows because both 
matrices in the product above are non singular. □ 

Example 14. The choice ofX, "H in the second part of example 11 does not satisfies (ii) 
still, numerical simulations indicate that Qx-H\n ^■^ "^'^ singular. This exemplifies that the 
conditions of being adequate for replacement are only sufficient. 

Remark 2. Theorem 6 of Roverato et al. [12] implies that, in the binary case, the model 
defined by aALb and aALb \ c, with all the elements of c equal 0, is smooth. If we set I 
= V{a Ub)\ {V{a) U V{b)) and n = {t, h : t e I, h = c(jj} with = 1, our Theorem 
2 implies that, under independence, the model aALb and aALb \ c, except when all the 
elements of c are equal 1, is smooth. 

3.4-2. The general case 

Unfortunately, the main arguments used above depend crucially on the assumption 
of complete independence. For discrete data, all the models which have been shown 
to be non smooth, have a singular locus which is a subset of that defined by complete 
independence; instead, for gaussian models [6, Example 4.4] indicates that a non smooth 
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model may have a singular locus not contained in the model of complete independence. 
It is also interesting to note that numerical evaluations of QxniTi outside the space of 
complete independence, indicate that it is of full rank even if "H does not satisfies the 
conditions of Definition 3. 

Taking into account all of the above, the extensive simulations which we have per- 
formed seem to support the conjecture that all the models obtained by replacing the 
interactions in X with an adequate replacement in are indeed smooth everywhere in 
the parameter space. Though very unlikely, we cannot rule out the possibility that the 
models above may be singular on points of the parameter space which are outside the 
subspace defined by the model of complete independence. However, the expression for 
Qxnin is very easy to compute and the software in MatLab and R which we provide as 
supplementary material, may be used for quick numerical checks. 

4. Context specific conditional independence models 

We have seen that a non smooth model may be transformed into a smooth one by 
omitting a class of log-linear interactions H to make place for other interactions defined in 
a previous marginal which we want to redefine in M. This implies that the values of the 
interactions in T-L are uniquely determined by the probabilities reconstructed in previous 
marginals and the log-linear interactions defined in M, thus they cannot be constrained. 
Because H A, certain constraints implied by the original conditional independence in 
M cannot be implemented; in addition, when ^ fl 7?. 7^ 0, further limitations must be 
taken into account. In this section we study the nature and scope of the actual constraints 
that can be imposed in M. We remind that a conditional independence statement, that 
holds only on a subset of the configurations of the conditioning variables, is a context 
specific conditional independence (Hojsgaard [10]). 

The collection of interactions = 'HU{AmZ) belongs to A but cannot be constrained 
in M; these interactions are of two kinds: (i) those which belong to "H have to be omitted 
as a replacement for duplicating those in X and (ii) those in TZ which we were unable to 
replicate because, if included in X, there would not exist an adequate for replacement. 
It follows that the conditional independence in M must be restricted to the context that 
does not require to constrain the collection of interactions J^. More precisely, point (i) 
implies that the conditional independence can be defined only in the contexts in which, 
for every {t, /i) G "H the variables in h are different from Point (ii) implies that the 
conditional independence can be defined only in the contexts where, for every maximal 
set m of X and v & A CMZ the variables belonging to f \ m are fixed to the reference 
category. Let us examine some of the previous example to clarify the situation. 

Example 15. Consider again example 5 here Ar\TZ is empty and 

•H = {(12, 5), (123, 5), (13, 5), (134, 5)}; 

it follows that we are left with 1_LL (2, 3) | (4, 5) , for all ^ J5. In example 6, because J = 
{124, 125, 1234, 1235, (12, 45), (123, 45)}, with X = {12, 123}, we can have IX 2 | (3, 4, 5) 
where X4,X5 are fixed to the reference category, a statement of much more limited scope 
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than the original one. Finally, in example 11, though J = {125, 126} is smaller than 
before, with X = {12,124}, we can only impose liL2 | (4,5,6) with X5,Xq fixed to the 
reference category. 

As mentioned above, the conditionally independence in M would not be restricted if 
the elements of H did not belong to A, however. Theorem 1 implies that the resulting 
model is non smooth. 

Example 16. In example 3 suppose that all variables have the same number of categories; 
because A = {12, 123, 124, 1234}, the conditional independence is not affected if we take "H 
= {23,234}; though this corresponds to the same number of parameters asX, the jacobian 
J of the LM algorithm is the identity matrix and the model is non smooth. 
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Appendix 

Interactions defined as contrasts of averages of logarithms of probabilities. 

An alternative to the reference category interactions r]i-M{xi) are the interactions 
based on contrasts of averages which may be defined as 

Tji-Mi^i I Xm\i) = ,^ (-1)'^^^' Yl log(^(^f" ^A^" ^M\i)), (7) 

bCI ^ \ / Xis^, 

where (/) denotes the number of possible configurations of the vector xj. 

When M \ / is not empty, the interactions defined in (7) depend on the value of the 
remaining variables and may be interpreted as the log-linear interaction / in the marginal 
distribution M conditionally on = Xh^h E M\I. To use the interactions defined in (7) 
as a parameterization, the usual way of removing redundancies is to average with respect 
the conditioning variables, leading to the following expression 



ViMixi) = J' Y Vl;Mixi I Xm\i) 



(M\J) 



1 



It is well known that both the contrasts of averages interactions f]a;M{Xa), used by 
Bergsma and Rudas [4], and the reference category interactions r]a-M{Xa) = rja-M^Xa \ 
^M\a)y used in this paper, are a parametrization of the joint probabihties. 
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Let 77(7, M; 0m\7) denote the vector of log-linear interactions in (7) when the variables 
in xj take all possible configurations in lexicographic order. It is easy to verify that we 
may write 77(7, M; Om\/) = S{I, M) logp(M), where 5(7, M) = 0^^^/ and Si is equal 
to the matrix / — ll'/(rj + 1) without the first row if z G 7 and to the vector (1 , 0') 
otherwise. It is also easy to verify that the vector of log-linear interactions f}{I,M), 
obtained by averaging over all possible configurations of the conditioning variables, may 
be written as fi{I,M) = 5(7, M) logp(M), where S{I,M) = <S)i(.M Si and Si is equal 
to Si when i E I and to the vector !'/('"« + 1) otherwise. 

Lemma 8. 

f7(7,M;OM\/) = Ar7(7,M) (8) 
77(7, M) = B77(X,M) (9) 

for suitable matrices of constants A and B. 

Proof. By substitution in (5), fi{I , M;Om\i) = S{I , M)G{M)r]{M) and (8) follows by 
noting that the kronecker product contains a factor if there is an z G 7, z ^ J, because 
Si is a matrix of row contrasts and Gi is the unitary vector; the same result arise if there 
is a z ^ 7, z G J, because 5j is the vector (1 ,0') and Gi is the /j matrix whose first 
row is a row of O's. Equation(9) follows by a similar argument: when i G 7, i ^ J, 5j = 
Si and we get a factor as above, instead, when i ^ I, i E J, Si is proportional to the 
unitary vector so that 5jGj is also proportional to a unitary vector. □ 

By noting that any category may be chosen as reference category for each variable, 
(8) implies that any log-linear interaction 7 defined in (7) is a linear function of the log- 
linear interactions 7 defined in (1) for all possible values of xj. Instead, the log-linear 
interactions 77(7, M), obtained by averaging across the conditioning variables, are a linear 
function of all ?;( J, M), 7 C J C M. 

Proofs of the Lemmas 

Proof. Proof of Lemma 4. Note that A is the residual variance in a linear model where 
the binary variables indexed by are regressed on the variables in TZ while B is the 
residual variance when "H is regressed on the variables in TZ and X. Then, properties of 
linear projections imply that we may write C = A — B where 

C = QiH\n [Fix - F xnF t^^F -jix) QxH\n 

is clearly a positive semi-definite matrix. Then Theorem 7.7.3 in Horn [11] implies that 
the spectral radius of SA~^, which is equal to that of A~^B^ is always not greater than 
1. The spectral radius is exactly 1 if and only if C or. equivalently, Qxn\Ti singular. □ 

Proof. Proof of Lemma 5. (i) Suppose there isaf: v E H, v ^ A and consider 
the intersections of v with the elements of X U 7^; these must belong to V and cannot 
be contained in A, hence they must belong to TZ; the argument in the proof of Theorem 
2 implies that the corresponding columns in Qxnin 0- (ii) Under independence 
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PaGv = (^^^^Ilj, where the factors 11^ are the entries of the last row of Table 1, 
a G (XU 7^) and v ^T-L. If there is a variable j G f which is not contained in any element 
of XUT^, the last entry in the forth column of Table 1 indicates that there will be a factor 
IjTT^- where tt'j has Vj columns; the result follows from an argument similar to the one at 
the beginning of the proof of Theorem 2. □ 



Table 1: 

j ea j 

jet J eh j^ituh) jet jeh j^ituh) 



PaU) 


I3 


h 


I3 








GtAj) 






h 


I3 


Bjl 


h 








h 




IjTTjh 





Proof. Proof of Lemma 6. When lC(t) is not a singleton, by construction, the intersec- 
tion of two or more elements of /C(t) is disjoint from h, thus /C(t, h) is formed by sets Q 
with cardinality not smaller than two. Let nt is the cardinality of /C(t) then: 



aG/C(i,/i) i=2 ^ ^ i=0 ^ ^ 



-nt 



thus, point (ii) of Definition 3 is satisfied. Point (iii) is trivially satisfied because X is 
a singleton. When /C(t) is a singleton all the conditions of Definition 1 are trivially 
satisfied. □ 



Proof. Proof of Lemma 7. Under independence PaGt^h = ^'j=i^j, where H 



Pa{j)Gt,h{j)', the possible values of Uj are given in Table 1 where is an identity 
matrix without the first column, Ij is a vector of ones Bji a vector of O's except for a 1 
in the Ith position, and iZj is the marginal distribution of Xj, all of dimension + 1. 
Point a) follows from the first two columns of Table 1, while b) follows from columns 1 
and 5. □ 
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